I choosed this data set, because i find that, this dataset has the most variables. I can choose from these variables, and explore more than the other dataset. I am also interested in the loans. How do the bank evaluate the loanees ? How do they record each loan ? How can we understand each loan deal ? I want find the answers. This dataset includes 113k observations. That is not a small amount, which makes more statistical meaning.
I tried to plot the loanstatus out with bar chart, but there are seldom observation labeled as “Past Due”.We can hardly see the boxes.
Then, i think it would be better if i merge the Past Due boxes into just one box. That would be more meaningful.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
I think,there maybe better choice for the y axis value except just count. The percentage number may have a better illustration as the proportion of each kind loan status. According to my understanding, the employment status of a loanee has great influence on their application for the loan. The employment status implicates the income stability and quantity of someone. That is essential for evaluating a loanee.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
Again, I think it would be better to change the y value into percentage.
##
## 12 36 60
## 1614 87778 24545
##
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Firstly, i am trying to get familiar with these variables. The commen amount of loans are around 3000, 10000,15000 and 20000.
After I cut the variable into 10th quantiles, it would be more clear. It shows that, nearly half of the loans are between 1000-5000\(. There are only a little amount of loan beyond 25000\).
The bank credit of a loan apllier is a strong indication that shows how much one is trusted by the bank. I am also curious about the homeownership of each loanee, so i facet the variable.But the x value seems to be unsuitable. The range is too wide for most observation.
Then instead of the credit, i use log10 value of the credit. There are more obvious difference between the homeowner and non-homeowner.Basicly speaking the homeowners have more credit than the non-homeowner. The number order distribution is more appearent in this situation. It is worth pointing out that, some loanees do not have any credit.
That shows, how many loanee do have Delinquencies.
The former plot is too rough.We can see more detail if we zoom in the plot. According the plot, most loanees do not have a delinquency.
LoanOriginalAmount: Min. 1st Qu. Median Mean 3rd Qu. Max. 1000 4000 6500 8337 12000 35000 Term: 12 36 60 1614 87778 24545 AvailableBankcardCredit: Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 0 880 4100 11210 13180 646300 7544 LoanStatus: Cancelled, Chargedoff,Completed,Current,Defaulted, FinalPaymentInProgress, Past Due EmploymentStatus:Employed, Full-time, Not available, Not employed, Other, Part-time, Retired, Self-employed ProsperRating..Alpha.: AA, A, B, C, D, E, HR
LoanOriginalAmount,IsBorrowerHomeowner,AvailableBankcardCredit,StatedMonthlyIncome,Occupation,EmploymentStatus
LoanOriginationDate,Term,MonthlyLoanPayment,BorrowerAPR,LP_ServiceFees,ProsperRating
Yes:
df\(LoanOriginalAmount_cut<-cut(df\)LoanOriginalAmount,c(0,1000,4000,6500,8500,12000,18000,25000,35000))
I log-transformed the AvailableBankcardCredit. Because the range of Bankcard credit is so huge, its plot do not make any sense. The log transformed plot shows us the order difference of the credit.
## Source: local data frame [10 x 2]
##
## Occupation n
## (fctr) (int)
## 1 Other 28617
## 2 Professional 13628
## 3 Computer Programmer 4478
## 4 Executive 4311
## 5 Teacher 3759
## 6 Administrative Assistant 3688
## 7 Analyst 3602
## 8 3588
## 9 Sales - Commission 3446
## 10 Accountant/CPA 3233
The income of the interested occupations are shown with boxplot.
This plot shows that, the relationship between median income and mean credit for different occupation. The linear regression shows that, we have a positive slope,and the intercept is around 3000$.
The BorrowerAPR clearly increase as the rating goes down.
##
## Pearson's product-moment correlation
##
## data: df.Credit_and_income_by_Occupation_owner.long$median_income and df.Credit_and_income_by_Occupation_owner.long$mean_Credit
## t = 15.342, df = 133, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7288239 0.8530914
## sample estimates:
## cor
## 0.7993492
Generally speaking, the loan original Amount tends to be larger as the ProsperRating goes higher. The loan original amount squeezes a lot when the prosper rating drop from C to D. The gap between other ratings are not that huge.
The loanee with home ownership generally has a higher credit and state income than the ones without ownership. The state monthly income of loanees vary a lot among different occupations. Even within the same occupation, the there is huge variance for some particular occupation. Some outlier can reach even 3 or 4 times as the mean income. ### What was the strongest relationship you found? For a certain occupation, the mean credit and median come are strongly related. The correlation between them is as high as 0.799. The loanee has to pay more interests if the loan has a lower ProsperRating.
Most of the points are densely located in the lower region and will definitely benefit with a different scale. And we can also remove the outliers at the top of the plot.
This looks much better.
DebtToIncomeRatio is strongly related to IncomeRange. The people who has the DebtToIncomeRatio higher than 1.25,are mostly in the range of $1-24999 for their income or not employed. Most people with part-time job have a low IncomeRange. For people who have a DebtToIncomeRatio higher than 10, they are all in the range of $1-24999 for their income or not employed.
The Borrower’s Annual Percentage Rate is closely related to the ProsperRating of the Loan. The loan with a higher rating trend to have a lower APR.But the borrower of the low rating loan has to pay more interest for the loan annually. The interest and fee of most loans are less than 5000.
There is slightly loan amount difference between the prosper rating “AA”,“A”,“B”,“C”. But when the rating get forward lower, the amount drop steeply. We can also find that, the amount distribution for rating “E” and “HR”, are not very uniformly distributed.
The available credit for most people is between 1000-12000$. From the plot , we can find that , about 5000 loanees do not have available credit. Generally speaking, Loanees who have a homeownership possess a higher available credit.
The employ status is strongly related to income range.For people in the range above 100000$, they are almost all emplyed,full-employed or self-emplyed. People with a part-time job are mostly within range “1-24999”. For people whose DebtToIncome Ratio higher than 1.25, they are all within income range “1-24999” or not employed.
The biggest challenge that i have met in this report is that, i am not familiar with the data variable.I have to firstly figure out the meaning of the variables, and then pick up some interesting ones. For the interested variables, i do not know the relation between them. I have to guess their potential connection, and choose corresponding plot type to examine my hypothesis, over and over again. Finally, i get some meaningful result according to my understanding. Another problem is that, when i tried to use Bankcard Credit as x axis value. The range is so huge. So insteadly, i use log10 value of the credits, the plot make more sense then. The limitation of my exploration is that,
the dataset is big enough. So the result must be influenced by fortuity.
I didn’t construct the cleaning process to delete the outliers.
Some critical information about the loanees are not included in the data set. For example, age, sex, education level, height,weight
For future work, if we can get more information about each loanee, we can try to build a mathematical model to predict the LoanOriginalAmount or BorrowerAPR for each observation. If we can get more data about the appliers who did not get the loan, we can try to find the decision tree of the bankers.